{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[View the runnable example on GitHub](https://github.com/intel-analytics/BigDL/tree/main/python/nano/tutorial/notebook/training/tensorflow/tensorflow_custom_training_multi_instance.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Accelerate TensorFlow Keras Customized Training Loop Using Multiple Instances\n",
    "\n",
    "BigDL-Nano provides a decorator `nano` (potentially with the help of `nano_multiprocessing` and `nano_multiprocessing_loss`) to handle keras model with customized training loop's multiple instance training."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To use multiple instances for TensorFlow Keras training, you need to install BigDL-Nano for TensorFlow(or Intel-Tensorflow):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# install the nightly-built version of bigdl-nano for tensorflow;\n",
    "!pip install --pre --upgrade bigdl-nano[stock_tensorflow_29,inference]\n",
    "!source bigdl-nano-init  # set environment variables"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 📝 **Note**\n",
    ">\n",
    "> Before starting your TensorFlow Keras application, it is highly recommended to run `source bigdl-nano-init` to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ⚠️ **Warning**\n",
    "> \n",
    "> For Jupyter Notebook users, we recommend to run the commands above, especially `source bigdl-nano-init` before jupyter kernel is started, or some of the optimizations may not take effect."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> ⚠️ **Warning**\n",
    "> \n",
    "> It has been found that some of the optimized malloc implementation applied by `source bigdl-nano-init` **may** cause memory leak. It could be avoided by `unset LD_PRELOAD` and `unset MALLOC_CONF`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We may first define a dummy dataset and model for the example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.nano.tf.keras import nano_multiprocessing, nano\n",
    "import tensorflow as tf\n",
    "\n",
    "tf.random.set_seed(0)\n",
    "global_batch_size = 32\n",
    "\n",
    "model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])\n",
    "optimizer = tf.keras.optimizers.SGD()\n",
    "loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True)\n",
    "\n",
    "dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(128).batch(\n",
    "    global_batch_size)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic usage for multi-process training on customized loop"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For customized training, users will define a personalized `train_step` (typically a `tf.function`) with their own gradient calculation and weight updating methods as well as a training loop (e.g., `train_whole_data` in following code block) to iterate over full dataset. For detailed information, you may refer to [Tensorflow Tutorial for customized trianing loop](https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch)."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make them run in a multi-process way, you may only add 2 lines of code.\n",
    "\n",
    "- add `nano_multiprocessing` to the `train_step` function with gradient calculation and applying process.\n",
    "- add `@nano(num_processes=...)` to the training loop function with iteration over full dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "@nano_multiprocessing  # <-- Just remove this line to run on 1 process\n",
    "@tf.function\n",
    "def train_step(inputs, model, loss_object, optimizer):\n",
    "    features, labels = inputs\n",
    "    with tf.GradientTape() as tape:\n",
    "        predictions = model(features, training=True)\n",
    "        loss = loss_object(labels, predictions)\n",
    "    gradients = tape.gradient(loss, model.trainable_variables)\n",
    "    optimizer.apply_gradients(zip(gradients, model.trainable_variables))\n",
    "    return loss\n",
    "\n",
    "@nano(num_processes=2)  # <-- Just remove this line to run on 1 process\n",
    "def train_whole_data(model, dataset, loss_object, optimizer, train_step):\n",
    "    for inputs in dataset:\n",
    "        print(train_step(inputs, model, loss_object, optimizer))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then run your training loop function as normal, the process will magically run on several (e.g., 2 in this case) processes collaborately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_whole_data(model, dataset, loss_object, optimizer, train_step)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> 📝 **Note**\n",
    ">\n",
    "> By setting `num_processes`, CPU cores will be automatically and evenly distributed among processes to avoid conflicts and maximize training throughput.\n",
    "> \n",
    "> During Nano TensorFlow Keras multi-instance training, the effective batch size is still the `batch_size` specified in datasets (32 in this example). Because we choose to match the semantics of TensorFlow distributed training (`MultiWorkerMirroredStrategy`), which intends to split the batch into multiple sub-batches for different workers."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced usage for customized loss\n",
    "\n",
    "Some times users may define their own loss function rather than use a pre-defined keras loss. We provide a `nano_multiprocessing_loss` decorator to support customized defined loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras import backend\n",
    "from bigdl.nano.tf.keras import nano_multiprocessing_loss\n",
    "\n",
    "@nano_multiprocessing_loss()\n",
    "def loss_object(x, pred):\n",
    "    res = backend.mean(tf.math.squared_difference(x, pred), axis=-1)\n",
    "    return res"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_whole_data(model, dataset, loss_object, optimizer, train_step)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Usage for Data Generator\n",
    "\n",
    "Data Generator is frequently used for users who needs to carry out real time data generation or large number of files' reading. Users should define them as a TFdataset by `from_generator` in this case and call an additionally `dataset._GeneratorState = dataset._GeneratorState(generator)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def dummy_data_generator():\n",
    "    for i in range(128):\n",
    "        yield tf.constant([i]), tf.constant([i])\n",
    "\n",
    "dataset = tf.data.Dataset.from_generator(dummy_data_generator,\n",
    "                                            output_signature=(tf.TensorSpec(shape=(1,), dtype=tf.float32),\n",
    "                                                              tf.TensorSpec(shape=(1,), dtype=tf.float32)))\n",
    "\n",
    "# necessary to initiate dataset._GeneratorState\n",
    "dataset._GeneratorState = dataset._GeneratorState(dummy_data_generator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_whole_data(model, dataset, loss_object, optimizer, train_step)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "chronos",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "f7cbcfcf124497a723b2fc91b0dad8cd6ed41af955928289a9d3478af9690021"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
